[DistInf] Enable SGLang disagg with MoRI-io & MoRI-EP for all existing models#154
[DistInf] Enable SGLang disagg with MoRI-io & MoRI-EP for all existing models#154lcskrishna wants to merge 1 commit intoROCm:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the SGLang disaggregated (prefill/decode) launch flow to support MoRI-IO + MoRI-EP across existing models, adding a DP-vs-TP switch and a YAML-based model flag catalog.
Changes:
- Added
sglang_disagg_mori_io_ep.sh(MoRI transfer backend launcher) withDP_MODEsupport and YAML-driven per-model flags. - Updated
run_xPyD_models.slurmto optionally route runs through the MoRI launcher (RUN_MORI=1), add DP-mode validation, and pass additional env knobs. - Added MoRI environment defaults (
mori_ep_env.sh) and a model catalog (models.yaml); updated the disagg inference Dockerfile for a newer base and NIC backend selection.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh | New MoRI-based disaggregated launcher with DP/TP mode selection, log-based readiness gating, and YAML model configuration. |
| scripts/sglang_disagg/run_xPyD_models.slurm | Adds MoRI entrypoint selection, DP-mode allowlist validation, and passes new MoRI-related env vars into containers. |
| scripts/sglang_disagg/mori_ep_env.sh | Introduces MoRI/SGLang multi-node environment defaults and DP-mode tuning knobs. |
| scripts/sglang_disagg/models.yaml | Centralizes per-model CLI flags for TP vs DP launch modes. |
| docker/sglang_disagg_inference.ubuntu.amd.Dockerfile | Updates base image and adds NIC backend build-time selection logic for MoRI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _run_mori="${RUN_MORI:-0}" | ||
| if [[ "$_run_mori" == "1" ]]; then | ||
| if model_allows_mori_ep "$MODEL_NAME"; then | ||
| RUN_FILE="sglang_disagg_mori_io_ep.sh" | ||
| echo "RUN_MORI=1: using $RUN_FILE for model '$MODEL_NAME'" | ||
| else | ||
| echo "Error: RUN_MORI=1 but MODEL_NAME '$MODEL_NAME' is not in MORI_EP_VALID_MODELS" >&2 | ||
| printf "MoRI EP allowed models:\n" >&2 | ||
| for m in "${MORI_EP_VALID_MODELS[@]}"; do | ||
| printf " - %s\n" "$m" >&2 | ||
| done | ||
| exit 1 | ||
| fi | ||
| if [[ ! -f "$RUN_FILE" ]]; then | ||
| echo "Error: RUN_MORI=1 requires '$RUN_FILE' in $(pwd)." >&2 | ||
| exit 1 | ||
| fi | ||
| else | ||
| echo "Error: Model '$MODEL_NAME' not found in MODEL_RUNFILES" | ||
| echo "Available models: ${!MODEL_RUNFILES[@]}" | ||
| exit 1 | ||
| RUN_FILE="sglang_disagg_server.sh" | ||
| echo "RUN_MORI not set: using $RUN_FILE" | ||
| fi |
There was a problem hiding this comment.
PR description says RUN_MORI “will be added as default”, but here _run_mori defaults to 0 and the script explicitly logs “RUN_MORI not set”. If the intended behavior is to enable MoRI by default, consider flipping the default to RUN_MORI=1 (or clarify the PR description / add a TODO noting it’s intentionally not default yet).
| if [[ ! -f "$RUN_FILE" ]]; then | ||
| echo "Error: RUN_MORI=1 requires '$RUN_FILE' in $(pwd)." >&2 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
RUN_FILE is checked as a relative path (-f "$RUN_FILE") and later used to build RUN_FILE_FULL based on MOONCAKE_REPO_DIR=$(pwd). In Slurm jobs the working directory depends on where sbatch was invoked, so submitting from outside this folder can break unexpectedly. Consider resolving paths relative to the script location (e.g., SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)) and basing MOONCAKE_REPO_DIR / RUN_FILE_FULL on that.
| # Define valid model names | ||
| VALID_MODELS=( \ | ||
| "Qwen3-32B" \ | ||
| "Mixtral-8x7B-v0.1" \ | ||
| "Mixtral-8x7B-Instruct-v0.1" \ | ||
| "Llama-3.1-8B-Instruct" \ | ||
| "Llama-3.1-405B-Instruct-FP8-KV" \ | ||
| "amd-Llama-3.3-70B-Instruct-FP8-KV" \ | ||
| "DeepSeek-V3" \ | ||
| "DeepSeek-R1" \ | ||
| ) |
There was a problem hiding this comment.
Model naming is now Mixtral-8x7B-Instruct-v0.1 in VALID_MODELS, but other launch helpers in this directory still reference Mixtral-8x7B-v0.1 (e.g., scripts/sglang_disagg/salloc_launch.sh:8 and scripts/sglang_disagg/sglang_disagg_server.sh model config maps). This inconsistency will cause validation failures or unexpected defaults depending on which entrypoint is used; consider updating the other scripts or supporting both aliases here.
| set -x | ||
| eval "$ROUTER_CMD" 2>&1 | tee "/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log" >/dev/null & | ||
| set +x | ||
| proxy_pid=$! | ||
| echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE})" |
There was a problem hiding this comment.
proxy_pid=$! here captures the PID of tee (the last process in the backgrounded pipeline), not the python3 -m sglang_router.launch_router process. As a result, kill "${proxy_pid}" may not actually stop the router, leaving it running and keeping port 2322 bound. Prefer launching without a pipe (redirect directly to a log file), or use a process-group/job-control approach so you can reliably terminate the router process.
| set -x | |
| eval "$ROUTER_CMD" 2>&1 | tee "/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log" >/dev/null & | |
| set +x | |
| proxy_pid=$! | |
| echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE})" | |
| proxy_log="/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log" | |
| set -x | |
| bash -lc 'exec ${ROUTER_CMD}' >"${proxy_log}" 2>&1 & | |
| set +x | |
| proxy_pid=$! | |
| echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE}, log=${proxy_log})" |
| echo "" | ||
| } | tee "$PREFILL_LOG" | ||
| set -x | ||
| eval "$PREFILL_CMD" 2>&1 | tee -a "$PREFILL_LOG" >/dev/null & |
There was a problem hiding this comment.
Same pipeline PID issue as the router: prefill_pid=$! is the PID of tee, not the actual sglang.launch_server process, so the subsequent kill "${prefill_pid}" may not stop the server. Consider redirecting logs without tee, or track/kill the real server PID (e.g., avoid pipelines or kill the process group).
| eval "$PREFILL_CMD" 2>&1 | tee -a "$PREFILL_LOG" >/dev/null & | |
| eval "$PREFILL_CMD" >>"$PREFILL_LOG" 2>&1 & |
| git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \ | ||
| cd /tmp/rdma-core && \ | ||
| mkdir -p build && cd build && \ | ||
| cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \ | ||
| ninja && ninja install && ldconfig && rm -rf /tmp/rdma-core |
There was a problem hiding this comment.
The rdma-core source clone/build/install runs unconditionally on every launch. This adds significant startup time and introduces a hard dependency on outbound network access and build tooling inside the container; if GitHub is unavailable the entire job will fail. Consider moving this into the Docker image (preferred), or gating it behind an opt-in env var and/or a “already installed” check before rebuilding.
| git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \ | |
| cd /tmp/rdma-core && \ | |
| mkdir -p build && cd build && \ | |
| cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \ | |
| ninja && ninja install && ldconfig && rm -rf /tmp/rdma-core | |
| ## Prefer baking this into the image; runtime install is opt-in to avoid | |
| ## unconditional startup cost and hard dependency on outbound network access. | |
| INSTALL_RDMA_CORE_AT_RUNTIME="${INSTALL_RDMA_CORE_AT_RUNTIME:-0}" | |
| if command -v ibv_devinfo >/dev/null 2>&1; then | |
| echo "RDMA userspace already present; skipping runtime rdma-core install." | |
| elif [[ "${INSTALL_RDMA_CORE_AT_RUNTIME}" == "1" ]]; then | |
| git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \ | |
| cd /tmp/rdma-core && \ | |
| mkdir -p build && cd build && \ | |
| cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \ | |
| ninja && ninja install && ldconfig && rm -rf /tmp/rdma-core | |
| if ! command -v ibv_devinfo >/dev/null 2>&1; then | |
| echo "ERROR: INSTALL_RDMA_CORE_AT_RUNTIME=1 was set, but rdma-core install did not complete successfully." >&2 | |
| exit 1 | |
| fi | |
| else | |
| echo "Skipping runtime rdma-core install (set INSTALL_RDMA_CORE_AT_RUNTIME=1 to enable this temporary workaround)." | |
| fi |
| NODE_RANK="${NODE_RANK:-0}" | ||
| MODEL_PATH=$MODEL_PATH | ||
| MODEL_NAME="${MODEL_NAME:-}" | ||
| xP="${xP:-1}" | ||
| yD="${yD:-1}" | ||
| DP_MODE="${DP_MODE:-0}" | ||
| # PARALLEL_MODE is derived strictly from DP_MODE for models.yaml (tp vs dp flags). | ||
| if [[ "$DP_MODE" == "1" ]]; then | ||
| PARALLEL_MODE=dp | ||
| else | ||
| PARALLEL_MODE=tp | ||
| fi | ||
| echo "PARALLEL_MODE=${PARALLEL_MODE} (DP_MODE=${DP_MODE})" | ||
|
|
||
| if [[ -z "${MODEL_NAME:-}" ]]; then | ||
| echo "ERROR: MODEL_NAME not set, exiting" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
MODEL_PATH=$MODEL_PATH will silently set MODEL_PATH to an empty string if it wasn't provided, and the script never validates that it is non-empty before using --model-path ${MODEL_PATH}. Add an explicit check (similar to MODEL_NAME) and use a safe default/assignment (e.g., MODEL_PATH="${MODEL_PATH:-}") so misconfiguration fails fast with a clear error.
| RUN pip install --upgrade sglang-router | ||
|
|
||
| # the default already installs mori with AINIC. Reinstall from source for cx7 and bnxt; ainic only clears USE_* overrides. | ||
| ARG NIC_BACKEND="cx7" |
There was a problem hiding this comment.
In a Dockerfile, ARG default values are taken literally (quotes are not stripped). With ARG NIC_BACKEND="cx7", the value becomes "cx7", so the later case "${NIC_BACKEND}" in cx7) will not match and will fall into the error branch. Use ARG NIC_BACKEND=cx7 (no quotes), and document quoting in the docker build --build-arg example if needed.
| ARG NIC_BACKEND="cx7" | |
| ARG NIC_BACKEND=cx7 |
| set -x | ||
| eval "$DECODE_CMD" 2>&1 | tee -a "$DECODE_LOG" >/dev/null & | ||
| set +x | ||
| decode_pid=$! | ||
|
|
There was a problem hiding this comment.
Same pipeline PID issue as the router/prefill: decode_pid=$! captures tee's PID, not the sglang.launch_server PID, so kill "${decode_pid}" may fail to stop the decode server cleanly. Recommend launching the server without a pipeline or killing the process group so the real server process is terminated.
| export MODELS_YAML MODEL_NAME PARALLEL_MODE | ||
| eval "$(python3 - <<'PY' | ||
| import os | ||
| import shlex | ||
| import sys | ||
| import yaml | ||
|
|
||
| config_path = os.environ["MODELS_YAML"] | ||
| model_name = os.environ["MODEL_NAME"] | ||
| mode = os.environ["PARALLEL_MODE"] | ||
|
|
||
| with open(config_path, "r", encoding="utf-8") as f: | ||
| models = yaml.safe_load(f) or {} | ||
|
|
||
| if model_name not in models: | ||
| print(f'echo "ERROR: Model {model_name} not found in {config_path}"; exit 1') | ||
| sys.exit(0) | ||
|
|
||
| cfg = models[model_name] or {} | ||
| prefill = cfg.get("prefill", {}) or {} | ||
| decode = cfg.get("decode", {}) or {} | ||
|
|
||
|
|
||
| def q(v): | ||
| return shlex.quote(str(v if v is not None else "")) | ||
|
|
||
|
|
||
| exports = { | ||
| "MODEL_BASE_FLAGS": cfg.get("base_flags", ""), | ||
| "MODEL_MODE_FLAGS": cfg.get(f"{mode}_flags", ""), | ||
| "MODEL_PREFILL_FLAGS": prefill.get(mode, ""), | ||
| "MODEL_DECODE_FLAGS": decode.get(mode, ""), | ||
| "MODEL_EXPERIMENTAL_FLAGS": cfg.get("experimental_flags", ""), | ||
| } | ||
|
|
||
| for key, value in exports.items(): | ||
| print(f"{key}={q(value)}") | ||
| PY | ||
| )" | ||
|
|
||
| PREFILL_MODEL_CONFIG="${MODEL_BASE_FLAGS} ${MODEL_MODE_FLAGS} ${MODEL_PREFILL_FLAGS} ${MODEL_EXPERIMENTAL_FLAGS}" | ||
| DECODE_MODEL_CONFIG="${MODEL_BASE_FLAGS} ${MODEL_MODE_FLAGS} ${MODEL_DECODE_FLAGS} ${MODEL_EXPERIMENTAL_FLAGS}" |
There was a problem hiding this comment.
This uses eval to ingest YAML-provided flag strings and later eval again to execute the assembled command. Because the YAML values can include shell expansions (and the current YAML intentionally contains $(seq ...)), this effectively allows arbitrary shell execution via models.yaml. If you only need list expansion for flags like --cuda-graph-bs, prefer representing them as explicit space-separated values in YAML (no $(...)) and build the python3 argv as a Bash array to avoid eval entirely.
Motivation
This PR aims to refactor MoRI IO & MoRI EP solution to SGLang disagg models.
Summary:
Test Plan
Submission Checklist